Thank you for the kind introduction. So today I'm going to talk about my recent work on
representation learning to model and to classify pathological speech. So I am in the lab since
May 2018 so this is my fourth PRS and since last PRS in summer I have some publications,
these are some of them. I have the paper at C-ARP which has the best paper award, my recent
paper at Interspeech, my accepted paper at ICASP. We already participated last year in
the Interspeech hackathon creating Alexa skills so this is why for this presentation an Alexa
skill is going to help me during my talk. And today I'm going to talk to you about this
recent paper submitted to speech communication about parallel representation learning for
classification of pathological speech. So pathological speech processing has been focused
in different diseases with different origins like patients with larynx, cancer, polyps,
nodules, patients with morphological diseases like cleft lip and palate or patients with
neurological diseases like Parkinson's disease, Huntington disease or Alzheimer's. Clinical
observations in the speech of patients can be measured and objectively analyzed with
the aim to address two main problems. The first one is to support the diagnosis of the
disease by classifying between healthy control subjects and patients and the second one,
once the patient is diagnosed is to predict the level of degradation of the speech of
the patients according to a specific clinical scale to measure intelligibility, articulation
among others. So my main aim is that general handcraft features extracted in the related
studies may not capture enough information to characterize the presence of pathological
disorders that affect different aspects of the speech production system. Classical features
addressed in the literature include phonation measures regarding perturbation of the vocal
fold vibration, articulation measures regarding formant frequencies, different resonances
in the vocal tract, prosody features regarding fundamental frequency, energy, speech rate
disturbance among others and intelligibility measures based on the water rate. So current
trends in pathological speech modeling can be divided or I divided into three main aspects.
The first one is based on speaker models using Gaussian mixture models, I-vectors or recent
X-vectors. Some of these models were already used in my recent paper at ICASB. The second
one is phonological features regarding the estimation and prediction of posterior probabilities
for phonological classes like plosives, nasals, fricatives among others and the third one
which is the method I would like to talk to you today about representation learning strategies.
It's mainly about embeddings derived from a neural network to represent pathological
speech signals. Their methods are inspired mainly from the natural language processing
community. So, Alexa is going to help me during this slide. Alexa, open presentation. Open
presentation. Hi, what do you want to know about? Methods. We proposed a novel strategy
based on unsupervised representation learning for automatic classification of pathological
speech. For such a purpose, we trained recurrent and convolutional autoencoders to extract
informative features to characterize the presence of speech disorders. We additionally proposed
a novel feature set based on the reconstruction error of the autoencoders. We think this can
be very good. So, thank you, Alexa. So, we propose...stop. So, the first case, we propose
a convolutional autoencoder with the aim to map the spatial distribution of the energy
which is present in a spectrogram. The input of the autoencoder is a male-scale spectrogram
divided in 128 filters and a time frame of 500 milliseconds. So, we consider the bottleneck
representation here to provide a suitable representation to reconstruct the input spectrogram.
And in the second hand, we have a recurrent autoencoder with the aim to model the temporal
evolution of the spectral components that are present in a speech frame. The input is
the same as in the previous case. It's a male-scale spectrogram with 128 filters and a time frame
of 500 milliseconds. And in this case, we consider as well the bottleneck features to
represent the speech signal. From both spectrograms, we propose two different feature sets to evaluate
the presence of speech disorders. The first one, as classically addressed, is the bottleneck
features. But we propose an additional feature set based on the mean square reconstruction
Presenters
M. Sc. Juan Camilo Vasquez Correa
Zugänglich über
Offener Zugang
Dauer
00:13:16 Min
Aufnahmedatum
2020-02-17
Hochgeladen am
2020-02-17 18:36:53
Sprache
en-US